Script (Unicode)

In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.[1] Some scripts support one and only one writing system and language, for example, Armenian. Other scripts support many different writing systems. For example, the Latin script supports English, French, German, Italian, Vietnamese and Latin. Some languages make use of multiple alternate writing systems, thus also use several scripts. In Turkish, the Arabic script was used before the 20th century, but transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

Complementary are the Unicode symbols: scripts and symbols cover all Unicode characters. The unified diacritical characters and unified punctuation characters frequently have the ā€œcommonā€ or ā€œinheritedā€ script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode 6.0 includes 26 ancient and historic scripts and 67 modern scripts. Unicode is actively working on many more as indicated by its roadmap.

Contents

[hide]

Definition and classification

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ā€˜Ć„ā€™ (sometimes called a ā€œSwedish Oā€) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicodeā€™s flexible scripts, combining marks and collation algorithms.

Common and inherited scripts

Unicode can assign a character in the UCS to a single script only. However, many characters ā€” those that are not part of a formal natural language writing system or are unified across many writing systems may be used in more than one script. For example, currency signs, symbols, numerals and punctuation marks. In these cases Unicode defines them as belonging to the common script (ISO 15924 code "Zyyy"). All in all Unicode has 6379 characters defined as "Common" script.

In addition, many diacritics and non-spacing combining characters may be applied to characters from more than one script. In these cases Unicode assigns them to the inherited script (ISO 15924 code Zinh), which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308  Ģˆ  combining diaeresis may combine with either U+0065 e latin small letter e to create a Latin "Ć«", or with U+0435 Šµ cyrillic small letter ie for the Cyrillic "ё". In the former case it inherits the Latin script of the base character whereas in the latter case it inherits the Cyrillic script of the base character. 523 Characters in Unicode are of the inherited script.

Ancient and historic scripts

Unicode includes 25 ancient scripts (out of use a thousand years or more) and historic scripts (out of use several hundred years)[2]

Script versus writing system

See also: phonemic and phonetic orthography.

"Writing system" is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.

Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.

Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.

Character categories within scripts

Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as Ē² (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.

Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as ā€œother letterā€ or ā€œmodifier letterā€. Ideographs such as Unihan ideographs are also categorized as ā€œother lettersā€. A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are neither uppercase nor lowercase.

Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that scripts. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.

Table of scripts in Unicode

Unicode defines 97 script names (called "Alias" or "Property value alias"), based on the ISO 15924 list, that are used in Unicode 6.0.[3] These 97 contain 25 ancient or historic scripts, the generic Zyyy Common (Code for undetermined script) script name for characters that are used in multiple script like diacritics, and the general Zzzz Unknown (Code for undetermined script). Not used are, among others, the script codes: Zsym (Symbols) and Zmth (Mathematical notation). These are considered not to be scripts in Unicode sense.

See also

References